

### Multimedia System-on-Chip Design with specialization on Application Acceleration with High-Level-Synthesis



#### <u> 領瑾 Jiin Lai</u>

Founder, CTO, VIA Technologies Inc.

- Bachelor's Degree of National Taiwan University majoring in Electrical Engineering 1983
- Master of Science of University of Texas, Austin majoring in Computer Engineering 1987

Jiin Lai was the Chief Technology Officer for VIA Technologies. He has over 30 years experience in the PC industry, and in the past 12 years in storage area. Early in his career, he is a software engineer developing EDA tools. Later he co-founded VIA technologies, developing PC chipsets, and x86 processor. He led the engineering team to develop Intel and AMD compatible chipsets, and x86-compatible processors. In the past decade, he developed SSD controller, and later, shift focus on developing distributed computational storage system. His responsibility including product and architecture development, with an eye toward to future computing architecture need. Mr. Lai holds over 50 US patents.

### Topics

- Objectives of HLS
- HLS Course Logistics



# **High-Level Statements**

- Computational power hits a plateau
- Hardware accelerators to the rescue



(Source: Accenture analysis)



• The rise of custom accelerator marketplace



### FPGA Development Made Easy

- HW language are low-level and very difficult
- Use C, C++, OpenCL,
   Python, or TensorFlow

 Know nothing about hardware design  Parallel programming Concept apply

- How software application interacts with FPGA
- Off-shelf Platform ready



#### Use C, C++ to program FPGAs

```
// 2D Convolution (11x11)
for (int y = 0; y < height; y++) {
  for (int x = 0; x < width; x++) {
    window = in_stream.read();
    sum = 0;
    for(int row=0; row<11; row++) {
        for(int col=0; col<11; col++) {
            if (index_in_range(x+col-5, y+col-5, width, height)) {
               sum += window.val[row][col]*coeffs[row][col];
            }
        }
        out_stream.write(sum);
    }
}</pre>
```









http://www.ecs.umass.edu/ece/labs/vlsicad/ece667/reading/hls-survey.pdf

### Think "Parallel"

- Data-level Parallelism
- Task-Level Parallelism

 Instruction (operator) -Level Parallelism

for (int i=0; i<N; i++)
{
 acc += A[i] \* B[i];
}</pre>

[data (0,n-1)]

[data (1,n-1)]

[data (m-1,n-1)] [data (m-1,4)]

Program Instance 0

Program Instance 1

Program Instance m-1

Data Parallelism

[data (0,4)]

[data (1,4)]

[data (0,3)]

[data (1,3)]

[data (m-1,3)]



Task Parallelism

[data (0,2)]

[data (1,2)]

[data (m-1,2)



[data (0,0)] 2

[data (1,0)] 2

[data (m-1,0)]

[data (0,1)]

[data (1,1)]

data (m-1,1

#### Software Interacts with FPGA





#### **Speedup Development by Libraries**

#### Use Extensive, Open Source Libraries



400+ functions across multiple libraries for performance-optimized out-of-the-box acceleration







### Example of Oil, Gas workload

#### Productivity

| Application | Application Code |  |  |  |  |
|-------------|------------------|--|--|--|--|
| 0           | ¢                |  |  |  |  |

- Not the traditional programming model for FPGAs:
- One Software Engineer, no previous O&G experience, one month to describe & implement entire RTM Algo in C++
- No optimized library calls, completely described in C++
- < 500 lines of code, < 50 Pragmas</p>

#### > Standard language, open source tools and libraries

#### Seismic Method for Oil and Gas industry

Seismic Imaging Technology

- Seismic Survey: Acoustic wave sampling
- Seismic Imaging: Mathematically process the wave traces to create an image
- ▶ RTM (Reverse Time Migration)
- High-fidelity algorithm for imaging complex sub-surface structures
- Cross-correlation between source wavefield and receiver wavefield
- Wavefield reconstruction by saved boundaries







#### Message Intelligence Appliance Cortical.io

- Semantic Supercomputing for NLU (Natural Language Understanding)
- Automatically classifies message based on semantics/meaning of the content

#### **Semantic Folding Explained**

Words, sentences & paragraphs are represented by a semantic fingerprints

- Each word is represented by
   16K binary contexts in a 2D vector
- > All operations are **binary**
- > Minimal source material required: reference material, textbooks, data sheets, emails, etc.
- > Creation of the semantic fingerprints is completely unsupervised
- > All meanings of a word are represented







### Why HLS?

- Productivity (Design and Verification)
- IP Reuse
- Better QoR
- End-to-end application acceleration by software designer
- For academic, A great tool/skill for research.





# Empower software designers to develop efficient application accelerator



### Course Contents

- Course Texts:
  - R. Kastner, Parallel Programming for FPGAs, arXiv, 2018
  - Xilinx ug902
- Supplementary Materials:
  - Reference Papers
  - Manual/Datasheets
- Lecture ppt & video 16 sessions
- Labs ~200 optional lab references
- In-class presentation 5 sessions + final project presentation
- Final project & presentation



# Logistics

- Off-class lecture & lab/assignment
  - Lecture is self-paced
  - Lab/assignment is self-paced with lab-work submission
- In-class presentation & discussion
  - Sign-up by Google-Form (submit by Thursday 3pm)
  - Presentation selected based on available time slots, weight, and submission time.
  - Refer to "HLS Course Plan.doc"



### In-class schedule and subjects

#### https://cool.ntu.edu.tw/courses/3773/modules/items/110288

| session | Date    | Suggest lecture title - self-paced   | pdf | video | In-classs discussion topics         | Assignment                          |
|---------|---------|--------------------------------------|-----|-------|-------------------------------------|-------------------------------------|
| 1       | 18-Sep  | Course Introduction                  | §   | §     |                                     | HLS flow                            |
|         |         | Introduction PYNQ & Lab2             | §   | §     |                                     | ug871 labs ug871-[1:7]              |
|         |         | Vitis OpenCL XRT and Lab3            | §   | §     |                                     | Lab1: Tool Installation             |
|         |         | Introduction to FPGA                 |     |       |                                     | Lab2: PYNQ axi-m & stream           |
|         |         |                                      |     |       |                                     | Lab3: OpenCL/XRT                    |
|         |         |                                      |     |       |                                     |                                     |
| 2       | 16-Oct  | Kernal IO Interface                  | ş   | §     | ug871 Labs ug871-[1:7]              | Xlinx Training Lab -xtrain-[1:13]   |
|         |         | Introduction to High Level Synthesis | ş   | §     | HLS, Vivado, Vitis usage experience | Xilinx HLS Coding Style             |
|         |         | FPGA - CLB                           | §   | §     | Lab1, Lab2, Lab3 sharing            | Xilinx HLS Design - xdesign-[1:15]  |
|         |         | FPGA - Memory                        | §   | §     |                                     |                                     |
|         |         |                                      | §   | §     |                                     |                                     |
|         |         |                                      |     |       |                                     |                                     |
| 3       | 30-Oct  | System Optimization - Host           | §   | §     | Xlinx Training Lab -xtrain-[1:13]   | Vitis Tutorial - vitis-[1:7]        |
|         |         | System Optimizatin - Kernel          | §   | §     | Xilinx HLS Design - xdesign-[1:15]  | UCSD Lab ucsd-[1:5]                 |
|         |         | FPGA - DSP                           | §   | §     |                                     | Cornell - ECE5775 cornell-[1:4]     |
|         |         | FPGA - Interconnect                  | §   | §     |                                     |                                     |
|         |         |                                      |     |       |                                     |                                     |
| 4       | 13-Nov  | Kernel Optimization - Area           | §   | §     | Vitis Tutorial - vitis-[1:7]        | pp4fpga-[1:8]                       |
|         |         | Kernel Optimization - Latency        | §   | §     | UCSD Lab ucsd-[1:5]                 | Xilinx HLx Examples xhls-[1:18]     |
|         |         | Kernel Optimization - Pipeline       | §   | §     | Cornell - ECE5775 cornell-[1:4]     |                                     |
|         |         |                                      |     |       |                                     |                                     |
|         |         |                                      | -   |       |                                     |                                     |
| 5       | 27-Nov  | Design Examples                      | §   | §     | pp4fpga-[1:8]                       |                                     |
|         |         | Application Cases                    | §   | §     | Xilinx HLx Examples xhls-[1:18]     | Xilinx Application Notes xapp-[1:13 |
|         |         |                                      |     |       |                                     |                                     |
| 6       | 11-Dec  |                                      |     |       | All topics                          | Final Project                       |
| -       |         |                                      |     |       |                                     | Refer to project resource weight=10 |
|         |         |                                      |     |       |                                     |                                     |
|         | 22-Jan  | Final Project presentation           |     |       |                                     |                                     |
|         | 22 3011 | i mai i i sjece presentation         |     |       |                                     |                                     |



#### Lectures – Self-Paced

#### 1. Tools & Platform

- a. Introduction to PYNQ & Lab2
- b. Vitis OpenCL XRT and Lab3

#### 2. FPGA (Xilinx)

- a. Introduction to FPGA
- b. FPGA CLB
- c. FPGA Memory
- d. FPGA DSP
- e. FPGA Interconnect

# 3. Concept of System Performance and Optimization

- a. Host Optimization
- b. Kernel Optimization

#### 1. HLS Development

- a. Introduction to High Level Synthesis
- b. Kernel IO Interface
- c. Kernel Optimization Area
- d. Kernel Optimization Latency
- e. Kernel Optimization Pipeline

#### 2. Design Examples and Application

- a. Design Examples
- b. Application Cases



### Platform/Tools for Labs

- Vivado HLS For Kernel optimization
  - Vivado HLS C-sim, Synthesis, Co-Sim, IP-generation
  - Analyze resource, latency, timeline/scheduling, waveform
- Pynq (MPSOC AXI) Embedded System
  - Run HLS C-sim, Co-sim, IP-generation
  - Vivado IP integration, block-design, generate bit-stream
  - Download to Zedboard/PYNQ-Z2 and run Jupyter Notebook
- Vitis & AWS-F1 (FPGA-PCIe) Cloud Application
  - Run HLS C-sim, Co-sim, IP-generation
  - Vitis run SW-emulation, HW-emulation, Bitstream generation
  - Upload to AWS, run application (host code) at host PC
  - Profiling and analyzing application performance



### Xilinx Tools & Exercise

- Exercises/tutorials provided to gain proficiency in design flow
  - Vivado HLS 2019.2
  - Vivado Design Suite 2019.2
  - Xilinx Vitis IDE/Makefile
  - AWS

Refer to "Xilinx Tool Flow.ppt"



#### Develop Basic Skill/Tools in the first two weeks

- The following three labs in the first two weeks
  - Lab#1 Tool installation and Implementation Flow.
  - Lab#2 Application Acceleration for Embedded System (PYNQ-Zedboard).
  - Lab#3 Application Acceleration for Cloud Environment (Amazon)



# Lab/Assignment & Submission Criteria

#### • Lab/Project reference resources

Refer to "HLS Lab Project Resources.xls" <u>https://cool.ntu.edu.tw/courses/3773/modules/items/110289</u>

#### • Weight Categories

| Weight | Description                                                       | Submission                                                          |
|--------|-------------------------------------------------------------------|---------------------------------------------------------------------|
|        | Single item, exercise optimization pragma, coding style,          | 1. Screen dump: HLS, latency, resource, io interface,               |
| 1,2    | setup/runing/analysis effort: 30min-1hr                           | timeline                                                            |
|        | code hoist, Exercise mulitple optimization, comparative analysis, | <ol><li>Vitis summary, HLS synthesis_report</li></ol>               |
| 3,4    | effort: 2-3 hr                                                    | <ol><li>ppt/word: description of observation and learning</li></ol> |
| 5-9    | algorithm level: code hoist, comparative analysis, effort: days   | 1. ppt & presentation                                               |
|        |                                                                   | <ul> <li>introduce domain knowledge/theorem</li> </ul>              |
|        | application level, need domain knowledge/background,              | - optimization method                                               |
|        | effort: weeks                                                     | - comparison of optimization merit / tradeoff                       |
| 10     | Candidate for final team project                                  | 2. github submission for publication                                |

• Other Lab/Project proposal is welcomed. Weight will be assigned.

#### Lab/Project References





# Lab/Project Resources

| INTELLIGET              | N T<br>G |    |                                                                            |                                   |
|-------------------------|----------|----|----------------------------------------------------------------------------|-----------------------------------|
| LabName                 | ID       | Wt | Торіс                                                                      |                                   |
| pp4fpga                 |          |    | https://github.com/KastnerRG/pp4fpgas/tree/master/examples_                | Kastnter pp4fpga text book        |
|                         | 1        | 3  | FIR                                                                        |                                   |
|                         | 2        | 3  | CORDIC<br>DFT                                                              |                                   |
|                         | 4        | 3  | Spare Matrix Vector                                                        |                                   |
|                         | 5        | 3  | Matrix Multiplication                                                      |                                   |
|                         | 6        | 3  | Prefix Sum and Histogram                                                   |                                   |
|                         | 7        | 3  | Video System                                                               |                                   |
|                         | 8        | 5  | Huffman Encoding                                                           |                                   |
| 185 <mark>h264</mark>   |          | 10 | https://github.com/adsc-hls/synthesizable_h264                             | H.264 Video Decoder               |
| 186 swater              |          | 10 | https://github.com/necst/coursera-sdaccel-practice                         | Smith-Waterman - gene sequence    |
| 187 <mark>cirrna</mark> |          | 10 | https://github.com/necst/circFAXOHW18public                                | circular RNA aligner              |
| 188 point5              |          | 10 | https://bitbucket.org/necst/xohw18_5points_public/src/master/              | five point relative pose problem  |
| 189 sha256              |          | 5  | https://github.com/dowenberghmark/FPGA-SHA256                              | SHA256                            |
| 190 beamf               |          | 5  | https://developer.xilinx.com/en/articles/beamforming-acceleration.html     | beamforming                       |
| 191 ethash              |          | 10 | https://developer.xilinx.com/en/articles/part1-introduction-to-ethash.html | blockchain - hashing for Ethereum |
| 192 mcarlo              |          | 10 | https://github.com/KitAway/FinancialModels_AmazonF1/tree/master            | Monte Carlo financial models      |
| 193 <mark>profax</mark> |          | 10 | https://bitbucket.org/necst/profax-src/src/master/                         | Protein Folding Algorithm         |
| 194 graph               |          | 10 | https://github.com/Xtra-Computing/ThunderGP                                | Graph Processing                  |



### **Course Credits**

- Earn credits from the followings:
  - Submit Lab/Assignment choose from lab/project references or propose yours
    - Credit based on the weight category
  - Class Presentation 5 minutes presentation
    - weight category \* quality <0.7 1.3> ( insight + presentation skill)
  - Final project 10 minutes presentation
    - Weight <5-10> \* quality <0.7 1.3> (insight + presentation skill)
- Where is the insight from
  - Fully understand the material
  - Deeper observation on the analysis report
  - Try out different optimization, make trade-off, and comparative analysis

